{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "# 🪙 American Express - Default Prediction Competition Kale Pipeline\n",
    "![](./images/background.jpg)\n",
    "\n",
    "---\n",
    "\n",
    "In this [Kaggle competition](https://www.kaggle.com/competitions/g-research-crypto-forecasting/overview), you'll use your machine learning expertise to predict credit default. This competition is hosted by American Express. \n",
    "\n",
    "> American Express is a globally integrated payments company. The largest payment card issuer in the world, they provide customers with access to products, insights, and experiences that enrich lives and build business success.\n",
    "\n",
    "The dataset provided is an industrial scale data set of about 5.5 million rows. It has been pre-processed and converted to a lightweight version by raddar for ease of training and better result. This dataset is available in a [parquet format][1].\n",
    "\n",
    "[1]: https://www.kaggle.com/datasets/raddar/amex-data-integer-dtypes-parquet-format"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Install necessary packages\n",
    "\n",
    "We can install the necessary package by either running pip install --user <package_name> or include everything in a requirements.txt file and run pip install --user -r requirements.txt. We have put the dependencies in a requirements.txt file so we will use the former method.\n",
    "\n",
    "NOTE: After installing python packages, restart notebook kernel before proceeding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "!pip install -r requirements.txt --user --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Imports\n",
    "\n",
    "In this section we import the packages we need for this example. Make it a habit to gather your imports in a single place. It will make your life easier if you are going to transform this notebook into a Kubeflow pipeline using Kale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": [
     "imports"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import os, subprocess\n",
    "import random, zipfile, joblib\n",
    "import scipy.stats\n",
    "import warnings\n",
    "import gc, wget\n",
    "\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "from lightgbm import LGBMClassifier\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Project hyper-parameters\n",
    "\n",
    "In this cell, we define the different hyper-parameters. Defining them in one place makes it easier to experiment with their values and also facilitates the execution of HP Tuning experiments using Kale and Katib."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": [
     "pipeline-parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# Hyper-parameters\n",
    "N_EST = 30\n",
    "LR = 0.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "Set random seed for reproducibility"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "def fix_all_seeds(seed):\n",
    "    np.random.seed(seed)\n",
    "    random.seed(seed)\n",
    "    os.environ['PYTHONHASHSEED'] = str(seed)\n",
    "\n",
    "fix_all_seeds(2022)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Download data\n",
    "\n",
    "In this section, we download the data from kaggle using the Kaggle API credentials"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:download_data"
    ]
   },
   "outputs": [],
   "source": [
    "# setup kaggle environment for data download\n",
    "dataset = \"amex-data-integer-dtypes-parquet-format\"\n",
    "\n",
    "# setup kaggle environment for data download\n",
    "with open('/secret/kaggle-secret/password', 'r') as file:\n",
    "    kaggle_key = file.read().rstrip()\n",
    "with open('/secret/kaggle-secret/username', 'r') as file:\n",
    "    kaggle_user = file.read().rstrip()\n",
    "\n",
    "os.environ['KAGGLE_USERNAME'], os.environ['KAGGLE_KEY'] = kaggle_user, kaggle_key\n",
    "\n",
    "# download kaggle's Amex-credit-prediction data\n",
    "subprocess.run([\"kaggle\",\"datasets\", \"download\", \"-d\", f'raddar/{dataset}'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:"
    ]
   },
   "outputs": [],
   "source": [
    "# path to download to\n",
    "data_path = 'data'\n",
    "\n",
    "# extract Amex-credit-prediction.zip to data_path\n",
    "with zipfile.ZipFile(f\"{dataset}.zip\",\"r\") as zip_ref:\n",
    "    zip_ref.extractall(data_path)\n",
    "    \n",
    "# download kaggle's Amex-credit-prediction train_labels.zip\n",
    "download_link = \"https://github.com/kubeflow/examples/blob/master/american-express-default-kaggle-competition/data/train_labels.zip?raw=true\"\n",
    "wget.download(download_link, f'{data_path}/train_labels.zip')\n",
    "\n",
    "# extract Amex-credit-prediction.zip to data_path\n",
    "with zipfile.ZipFile(f'{data_path}/train_labels.zip','r') as zip_ref:\n",
    "    zip_ref.extractall(data_path)\n",
    "    \n",
    "# delete zipfiles\n",
    "subprocess.run(['rm', f'{dataset}.zip'])\n",
    "subprocess.run(['rm', f'{data_path}/train_labels.zip'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Load the dataset\n",
    "\n",
    "First, let us load and analyze the data.\n",
    "\n",
    "The data is in csv format, thus, we use the handy read_csv pandas method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": [
     "block:load_data",
     "prev:download_data"
    ]
   },
   "outputs": [],
   "source": [
    "TRAIN_CSV = (f'{data_path}/train.parquet')\n",
    "TEST_CSV = f'{data_path}/test.parquet'\n",
    "TARGET_CSV = f'{data_path}/train_labels.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "target shape: (458913,)\n"
     ]
    }
   ],
   "source": [
    "df_train = pd.read_parquet(TRAIN_CSV)\n",
    "df_test = pd.read_parquet(TEST_CSV)\n",
    "target = pd.read_csv(TARGET_CSV).target.values\n",
    "print(f\"target shape: {target.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(5531451, 190)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "customer_ID          0\n",
       "S_2                  0\n",
       "P_2              45985\n",
       "D_39                 0\n",
       "B_1                  0\n",
       "                ...   \n",
       "D_141           101548\n",
       "D_142          4587043\n",
       "D_143                0\n",
       "D_144            40727\n",
       "D_145                0\n",
       "Length: 190, dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Define Helper Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "tags": [
     "functions"
    ]
   },
   "outputs": [],
   "source": [
    "# @yunchonggan's fast metric implementation\n",
    "# From https://www.kaggle.com/competitions/amex-default-prediction/discussion/328020\n",
    "def amex_metric(y_true: np.array, y_pred: np.array) -> float:\n",
    "\n",
    "    # count of positives and negatives\n",
    "    n_pos = y_true.sum()\n",
    "    n_neg = y_true.shape[0] - n_pos\n",
    "\n",
    "    # sorting by descring prediction values\n",
    "    indices = np.argsort(y_pred)[::-1]\n",
    "    preds, target = y_pred[indices], y_true[indices]\n",
    "\n",
    "    # filter the top 4% by cumulative row weights\n",
    "    weight = 20.0 - target * 19.0\n",
    "    cum_norm_weight = (weight / weight.sum()).cumsum()\n",
    "    four_pct_filter = cum_norm_weight <= 0.04\n",
    "\n",
    "    # default rate captured at 4%\n",
    "    d = target[four_pct_filter].sum() / n_pos\n",
    "\n",
    "    # weighted gini coefficient\n",
    "    lorentz = (target / n_pos).cumsum()\n",
    "    gini = ((lorentz - cum_norm_weight) * weight).sum()\n",
    "\n",
    "    # max weighted gini coefficient\n",
    "    gini_max = 10 * n_neg * (1 - 19 / (n_pos + 20 * n_neg))\n",
    "\n",
    "    # normalized weighted gini coefficient\n",
    "    g = gini / gini_max\n",
    "\n",
    "    return 0.5 * (g + d)\n",
    "\n",
    "def lgb_amex_metric(y_true, y_pred):\n",
    "    \"\"\"The competition metric with lightgbm's calling convention\"\"\"\n",
    "    return ('amex_metric_score',\n",
    "            amex_metric(y_true, y_pred),\n",
    "            True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Feature Engineering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "tags": [
     "block:feature_engineering",
     "prev:load_data"
    ]
   },
   "outputs": [],
   "source": [
    "# feature engineering gotten from https://www.kaggle.com/code/ambrosm/amex-lightgbm-quickstart\n",
    "def get_features(df, \n",
    "                 features_avg, \n",
    "                 features_min, \n",
    "                 features_max, \n",
    "                 features_last\n",
    "                ):\n",
    "    '''\n",
    "    This function takes a dataframe with all features and returns the aggregated feature grouped by the customer id.\n",
    "    \n",
    "    df - dataframe\n",
    "    '''\n",
    "    cid = pd.Categorical(df.pop('customer_ID'), ordered=True) # get customer id\n",
    "    last = (cid != np.roll(cid, -1)) # mask for last statement of every customer\n",
    "   \n",
    "    df_avg = (df\n",
    "              .groupby(cid)\n",
    "              .mean()[features_avg]\n",
    "              .rename(columns={f: f\"{f}_avg\" for f in features_avg})\n",
    "             ) \n",
    "    \n",
    "    df_min = (df\n",
    "              .groupby(cid)\n",
    "              .min()[features_min]\n",
    "              .rename(columns={f: f\"{f}_min\" for f in features_min})\n",
    "             )\n",
    "    gc.collect()\n",
    "    print('Computed min')\n",
    "    \n",
    "    df_max = (df\n",
    "              .groupby(cid)\n",
    "              .max()[features_max]\n",
    "              .rename(columns={f: f\"{f}_max\" for f in features_max})\n",
    "             )\n",
    "    gc.collect()\n",
    "    print('Computed max')\n",
    "    \n",
    "    df = (df.loc[last, features_last]\n",
    "          .rename(columns={f: f\"{f}_last\" for f in features_last})\n",
    "          .set_index(np.asarray(cid[last]))\n",
    "         )\n",
    "    gc.collect()\n",
    "    print('Computed last')\n",
    "    \n",
    "    df_ = pd.concat([df, df_min, df_max, df_avg], axis=1, )\n",
    "    \n",
    "    del df, df_avg, df_min, df_max, cid, last\n",
    "    \n",
    "    return df_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "features_avg = ['B_1', 'B_2', 'B_3', 'B_4', 'B_5', 'B_6', 'B_8', 'B_9', 'B_10', 'B_11', 'B_12', 'B_13', 'B_14', 'B_15', \n",
    "                'B_16', 'B_17', 'B_18', 'B_19', 'B_20', 'B_21', 'B_22', 'B_23', 'B_24', 'B_25', 'B_28', 'B_29', 'B_30', \n",
    "                'B_32', 'B_33', 'B_37', 'B_38', 'B_39', 'B_40', 'B_41', 'B_42', 'D_39', 'D_41', 'D_42', 'D_43', 'D_44', \n",
    "                'D_45', 'D_46', 'D_47', 'D_48', 'D_50', 'D_51', 'D_53', 'D_54', 'D_55', 'D_58', 'D_59', 'D_60', 'D_61', \n",
    "                'D_62', 'D_65', 'D_66', 'D_69', 'D_70', 'D_71', 'D_72', 'D_73', 'D_74', 'D_75', 'D_76', 'D_77', 'D_78', \n",
    "                'D_80', 'D_82', 'D_84', 'D_86', 'D_91', 'D_92', 'D_94', 'D_96', 'D_103', 'D_104', 'D_108', 'D_112', 'D_113', \n",
    "                'D_114', 'D_115', 'D_117', 'D_118', 'D_119', 'D_120', 'D_121', 'D_122', 'D_123', 'D_124', 'D_125', 'D_126', \n",
    "                'D_128', 'D_129', 'D_131', 'D_132', 'D_133', 'D_134', 'D_135', 'D_136', 'D_140', 'D_141', 'D_142', 'D_144', \n",
    "                'D_145', 'P_2', 'P_3', 'P_4', 'R_1', 'R_2', 'R_3', 'R_7', 'R_8', 'R_9', 'R_10', 'R_11', 'R_14', 'R_15', 'R_16', \n",
    "                'R_17', 'R_20', 'R_21', 'R_22', 'R_24', 'R_26', 'R_27', 'S_3', 'S_5', 'S_6', 'S_7', 'S_9', 'S_11', 'S_12', 'S_13', \n",
    "                'S_15', 'S_16', 'S_18', 'S_22', 'S_23', 'S_25', 'S_26']\n",
    "features_min = ['B_2', 'B_4', 'B_5', 'B_9', 'B_13', 'B_14', 'B_15', 'B_16', 'B_17', 'B_19', 'B_20', 'B_28', 'B_29', 'B_33', 'B_36', \n",
    "                'B_42', 'D_39', 'D_41', 'D_42', 'D_45', 'D_46', 'D_48', 'D_50', 'D_51', 'D_53', 'D_55', 'D_56', 'D_58', 'D_59', \n",
    "                'D_60', 'D_62', 'D_70', 'D_71', 'D_74', 'D_75', 'D_78', 'D_83', 'D_102', 'D_112', 'D_113', 'D_115', 'D_118', 'D_119', \n",
    "                'D_121', 'D_122', 'D_128', 'D_132', 'D_140', 'D_141', 'D_144', 'D_145', 'P_2', 'P_3', 'R_1', 'R_27', 'S_3', 'S_5', \n",
    "                'S_7', 'S_9', 'S_11', 'S_12', 'S_23', 'S_25']\n",
    "features_max = ['B_1', 'B_2', 'B_3', 'B_4', 'B_5', 'B_6', 'B_7', 'B_8', 'B_9', 'B_10', 'B_12', 'B_13', 'B_14', 'B_15', 'B_16', 'B_17', \n",
    "                'B_18', 'B_19', 'B_21', 'B_23', 'B_24', 'B_25', 'B_29', 'B_30', 'B_33', 'B_37', 'B_38', 'B_39', 'B_40', 'B_42', 'D_39', \n",
    "                'D_41', 'D_42', 'D_43', 'D_44', 'D_45', 'D_46', 'D_47', 'D_48', 'D_49', 'D_50', 'D_52', 'D_55', 'D_56', 'D_58', 'D_59', \n",
    "                'D_60', 'D_61', 'D_63', 'D_64', 'D_65', 'D_70', 'D_71', 'D_72', 'D_73', 'D_74', 'D_76', 'D_77', 'D_78', 'D_80', 'D_82', \n",
    "                'D_84', 'D_91', 'D_102', 'D_105', 'D_107', 'D_110', 'D_111', 'D_112', 'D_115', 'D_116', 'D_117', 'D_118', 'D_119', \n",
    "                'D_121', 'D_122', 'D_123', 'D_124', 'D_125', 'D_126', 'D_128', 'D_131', 'D_132', 'D_133', 'D_134', 'D_135', 'D_136', \n",
    "                'D_138', 'D_140', 'D_141', 'D_142', 'D_144', 'D_145', 'P_2', 'P_3', 'P_4', 'R_1', 'R_3', 'R_5', 'R_6', 'R_7', 'R_8', \n",
    "                'R_10', 'R_11', 'R_14', 'R_17', 'R_20', 'R_26', 'R_27', 'S_3', 'S_5', 'S_7', 'S_8', 'S_11', 'S_12', 'S_13', 'S_15', 'S_16', \n",
    "                'S_22', 'S_23', 'S_24', 'S_25', 'S_26', 'S_27']\n",
    "features_last = ['B_1', 'B_2', 'B_3', 'B_4', 'B_5', 'B_6', 'B_7', 'B_8', 'B_9', 'B_10', 'B_11', 'B_12', 'B_13', 'B_14', 'B_15', 'B_16', \n",
    "                 'B_17', 'B_18', 'B_19', 'B_20', 'B_21', 'B_22', 'B_23', 'B_24', 'B_25', 'B_26', 'B_28', 'B_29', 'B_30', 'B_32', 'B_33', \n",
    "                 'B_36', 'B_37', 'B_38', 'B_39', 'B_40', 'B_41', 'B_42', 'D_39', 'D_41', 'D_42', 'D_43', 'D_44', 'D_45', 'D_46', 'D_47', \n",
    "                 'D_48', 'D_49', 'D_50', 'D_51', 'D_52', 'D_53', 'D_54', 'D_55', 'D_56', 'D_58', 'D_59', 'D_60', 'D_61', 'D_62', 'D_63', \n",
    "                 'D_64', 'D_65', 'D_69', 'D_70', 'D_71', 'D_72', 'D_73', 'D_75', 'D_76', 'D_77', 'D_78', 'D_79', 'D_80', 'D_81', 'D_82', \n",
    "                 'D_83', 'D_86', 'D_91', 'D_96', 'D_105', 'D_106', 'D_112', 'D_114', 'D_119', 'D_120', 'D_121', 'D_122', 'D_124', 'D_125', \n",
    "                 'D_126', 'D_127', 'D_130', 'D_131', 'D_132', 'D_133', 'D_134', 'D_138', 'D_140', 'D_141', 'D_142', 'D_145', 'P_2', 'P_3', \n",
    "                 'P_4', 'R_1', 'R_2', 'R_3', 'R_4', 'R_5', 'R_6', 'R_7', 'R_8', 'R_9', 'R_10', 'R_11', 'R_12', 'R_13', 'R_14', 'R_15', \n",
    "                 'R_19', 'R_20', 'R_26', 'R_27', 'S_3', 'S_5', 'S_6', 'S_7', 'S_8', 'S_9', 'S_11', 'S_12', 'S_13', 'S_16', 'S_19', 'S_20', \n",
    "                 'S_22', 'S_23', 'S_24', 'S_25', 'S_26', 'S_27']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# apply feature engineering function\n",
    "train = get_features(df_train, features_avg, features_min, features_max, features_last)\n",
    "test = get_features(df_test, features_avg, features_min, features_max, features_last)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# check null values\n",
    "train.isna().any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Modelling: StratifiedKFold\n",
    "\n",
    "We cross-validate with a six-fold StratifiedKFold to handle the imbalanced nature of the target.\n",
    "\n",
    "Lightgbm handles null values efficiently."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:modelling",
     "prev:feature_engineering"
    ]
   },
   "outputs": [],
   "source": [
    "# Cross-validation\n",
    "\n",
    "features = [f for f in train.columns if f != 'customer_ID' and f != 'target']\n",
    "\n",
    "print(f\"{len(features)} features\")\n",
    "\n",
    "score_list = [] # lgbm score per fold\n",
    "y_pred_list = [] # fold predictions list\n",
    "\n",
    "# init StratifiedKFold\n",
    "kf = StratifiedKFold(n_splits=4)\n",
    "\n",
    "for fold, (idx_tr, idx_va) in enumerate(kf.split(train, target)):\n",
    "    \n",
    "    X_tr, X_va, y_tr, y_va, model = None, None, None, None, None\n",
    "\n",
    "    X_tr = train.iloc[idx_tr][features]\n",
    "    X_va = train.iloc[idx_va][features]\n",
    "    y_tr = target[idx_tr]\n",
    "    y_va = target[idx_va]\n",
    "    \n",
    "    # init model\n",
    "    model = LGBMClassifier(n_estimators=int(N_EST),\n",
    "                          learning_rate=float(LR), \n",
    "                          random_state=2022)\n",
    "    # fit model\n",
    "    model.fit(X_tr, y_tr,\n",
    "              eval_set = [(X_va, y_va)], \n",
    "              eval_metric=[lgb_amex_metric],\n",
    "              verbose = 20,\n",
    "              early_stopping_rounds=30)\n",
    "    \n",
    "    X_tr, y_tr = None, None\n",
    "    \n",
    "    # fold validation set predictions\n",
    "    y_va_pred = model.predict_proba(X_va, raw_score=True)\n",
    "    \n",
    "    # model score\n",
    "    score = amex_metric(y_va, y_va_pred)\n",
    "\n",
    "    print(f\"Score = {score}\")\n",
    "    score_list.append(score)\n",
    "    \n",
    "    # test set predictions\n",
    "    y_pred_list.append(model.predict_proba(test[features], raw_score=True))\n",
    "    \n",
    "    print(f\"Fold {fold}\") \n",
    "\n",
    "# save model\n",
    "joblib.dump(model, 'lgb.jl')\n",
    "print(f\"OOF Score:                       {np.mean(score_list):.5f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# feature importance for top 30 features\n",
    "fea_imp = pd.DataFrame({'imp':model.feature_importances_, 'col': features})\n",
    "fea_imp = fea_imp.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]\n",
    "_ = fea_imp.plot(kind='barh', x='col', y='imp', figsize=(20, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:evaluation_result",
     "prev:modelling"
    ]
   },
   "outputs": [],
   "source": [
    "model = joblib.load('lgb.jl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "binary_logloss = model.booster_.best_score.get('valid_0').get('binary_logloss')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "amex_metric_score = model.booster_.best_score.get('valid_0').get('amex_metric_score')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "pipeline-metrics"
    ]
   },
   "outputs": [],
   "source": [
    "print(binary_logloss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "pipeline-metrics"
    ]
   },
   "outputs": [],
   "source": [
    "print(amex_metric_score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Submission"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "sub = pd.DataFrame({'customer_ID': test.index,\n",
    "                    'prediction': np.mean(y_pred_list, axis=0)})\n",
    "sub.to_csv('submission.csv', index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "sub"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kubeflow_notebook": {
   "autosnapshot": true,
   "experiment": {
    "id": "new",
    "name": "american-express-defaul-prediction"
   },
   "experiment_name": "american-express-defaul-prediction",
   "katib_metadata": {
    "algorithm": {
     "algorithmName": "grid"
    },
    "maxFailedTrialCount": 3,
    "maxTrialCount": 12,
    "objective": {
     "objectiveMetricName": "",
     "type": "minimize"
    },
    "parallelTrialCount": 3,
    "parameters": []
   },
   "katib_run": false,
   "pipeline_description": "predicting credit default",
   "pipeline_name": "american-express-defaul-prediction-pipeline",
   "snapshot_volumes": true,
   "steps_defaults": [
    "label:access-ml-pipeline:true",
    "label:kaggle-secret:true",
    "label:access-rok:true"
   ],
   "volume_access_mode": "rwm",
   "volumes": [
    {
     "annotations": [],
     "mount_point": "/home/jovyan",
     "name": "test-workspace-qtvmt",
     "size": 32,
     "size_type": "Gi",
     "snapshot": false,
     "type": "clone"
    }
   ]
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
